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Abstract.  In  social  network  research,  community  study  is  one 
flourishing  aspect  which  leads  to  insightful  solutions  to  many  practi¬ 
cal  challenges.  Despite  the  ubiquitous  existence  of  communities  in  social 
networks  and  their  properties  of  depicting  users  and  links,  they  have 
not  been  explicitly  considered  in  information  diffusion  models.  Previous 
studies  on  social  networks  discovered  that  links  between  communities 
function  differently  from  those  within  communities.  However,  no  infor¬ 
mation  diffusion  model  has  yet  considered  how  the  community  structure 
affects  the  diffusion  process. 

Motivated  by  this  important  absence,  we  conduct  exploratory  stud¬ 
ies  on  the  effects  of  communities  in  information  diffusion  processes.  Our 
observations  on  community  effects  can  help  to  solve  many  tasks  in  the 
studies  of  information  diffusion.  As  an  example,  we  show  its  applica¬ 
tion  in  solving  one  of  the  most  important  problems  about  information 
diffusion:  the  influence  maximization  problem.  We  propose  a  community- 
based  fast  influence  (CFI)  model  which  leverages  the  community  effects 
on  the  diffusion  of  information  and  provides  an  effective  approximate 
algorithm  for  the  influence  maximization  problem. 


1  Introduction 

For  many  years,  community  study  is  one  of  the  hot  topics  in  social  network 
research.  Studies  in  this  area  offer  insightful  solutions  to  many  classic  problems  of 
social  network  research,  such  as  network  evolution  [14],  recommendation  system 
[19],  and  expert  finding  [2].  Communities  can  be  potentially  helpful  for  studies 
on  diffusion  of  information  in  social  networks.  Previous  studies  found  that  links 
between  communities  function  differently  from  those  within  communities:  friends 
in  the  same  communities  have  stronger  links,  but  weaker  links  between  friends  in 
different  communities  are  crucial  in  the  diffusion  of  novel  information,  because 
these  links  provide  more  useful  information  to  people  [1,7,12]. 

Some  key  problems  in  the  studies  of  information  diffusion  have  been  found 
difficult  to  solve  by  traditional  information  diffusion  methods.  Studies  on  the 
community  structure  of  social  networks  may  bring  new  ideas  for  solving  these 
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problems.  For  example,  one  of  the  key  problems,  the  influence  maximization 
problem  for  the  independent  cascade  model  has  been  proved  to  be  NP-hard  [15]. 
By  considering  the  effects  of  community  studies  on  the  diffusion  of  information, 
we  can  easily  come  up  with  some  intuitive  heuristics  to  solve  that  problem  more 
efficiently.  For  example,  we  may  utilize  the  community  homophily  to  quickly 
estimate  the  influence  of  users.  We  may  also  select  seed  nodes  from  different 
communities  to  minimize  the  overlap  and  maximize  the  influence. 

However,  few  existing  work  explicitly  studied  the  effects  of  communities  on 
the  diffusion  of  information,  or  use  these  effects  to  solve  diffusion-related  prob¬ 
lems.  Motivated  by  this  important  absence,  we  introduce  the  first  exploratory 
study  on  the  effects  of  communities  on  information  diffusion  processes.  By  ana¬ 
lyzing  real-world  datasets,  we  study  the  diffusion  of  information  with  communi¬ 
ties.  We  first  observe  the  action  homophily  of  communities,  and  then  introduce 
the  concept  of  role-based  homophily  of  communities,  which  consists  of  influencee 
role  homophily  and  influencer  role  homophily.  We  discover  that  the  role-based 
homophily  is  significantly  stronger  than  the  action  homophily. 

Our  findings  on  community  effects  can  lead  to  insightful  solutions  to  many 
problems  in  information  diffusion  studies.  As  an  example  application  of  these 
findings,  we  propose  an  approximate  solution  for  the  influence  maximization 
problem.  We  design  a  community-based  fast  influence  (CFI)  model  based  on  the 
influencee  role  homophily  of  communities.  The  CFI  model  applies  a  community 
clustering  method  to  social  networks,  and  makes  aggregations  on  users’  roles 
as  influencees.  Influence  maximization  algorithm  based  on  the  CFI  model  can 
efficiently  select  seed  nodes  to  maximize  the  influence.  The  main  contributions 
are  summarized  in  the  following: 

1.  We  conduct  quantitative  analyses  on  real-world  datasets  to  explore  the  effects 
of  community  on  the  diffusion  of  information. 

2.  We  get  valuable  findings  about  the  community  effects  from  quantitative  anal¬ 
yses.  We  introduce  the  concept  of  role-based  homophily  of  communities.  These 
understandings  can  bring  new  insights  to  the  studies  of  information  diffusion. 

3.  We  show  an  example  application  of  our  findings  on  the  influence  maximiza¬ 
tion  problem.  We  propose  a  community-based  fast  influence  (CFI)  model,  and 
an  efficient  approximate  influence  maximization  algorithm  based  on  that  model. 

2  Related  work 

Information  Diffusion  Problem.  Several  models  have  been  proposed  for  the 
information  diffusion  processes.  The  independent  cascade  (IC)  model  and  its 
variants  are  most  widely  used  information  diffusion  models  [10,15,16,21].  The 
basic  idea  of  the  IC  model  is:  if  a  node  in  a  social  network  becomes  active,  it  can 
make  its  neighbors  active  with  a  probability,  and  for  each  node  the  attempts  of 
its  neighbors  to  activate  it  are  independent.  The  influence  maximization  problem 
has  been  defined  for  the  IC  model  and  a  few  other  information  diffusion  models 
[15].  Given  an  IC  model,  the  problem  is  to  select  a  seed  set  with  k  nodes  so 
that  the  expected  number  of  active  nodes  are  maximized.  This  problem  has 
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been  proved  to  be  NP-hard.  The  first  solution  to  it  is  a  greedy  algorithm  that 
repeatedly  invokes  a  computational  expensive  sampling  method  [15].  Heuristic 
algorithms  and  optimized  versions  of  the  greedy  algorithm  have  been  proposed  in 
previous  works  [3,4, 17].  Work  in  [22]  proposed  an  heuristic  algorithm  which  finds 
influencers  from  communities.  Different  from  that  work,  our  proposed  model  is 
built  on  observations  on  real  data  and  baed  on  a  substantially  different  idea.  A 
recent  work  in  [8]  defined  a  group-based  version  of  the  influence  maximization 
problem.  The  predefined  groups  studied  in  that  work  were  not  conceptually 
equivalent  to  the  communities  studied  in  this  paper. 

Community  Detection.  Community  detection  in  social  networks  has  been 
studied  for  years.  Varieties  of  algorithms  have  been  proposed.  A  good  survey  is 
available  in  [18].  We  are  not  going  to  discuss  the  varieties  of  existing  community 
detection  methods,  except  for  those  that  are  related  to  our  work  in  this  paper. 
Modularity-based  methods  are  a  major  class  of  community  detection  methods. 
Among  these  methods,  the  fast  greedy  method  [5]  is  frequently  used  for  com¬ 
munity  detection  on  large-scale  networks.  In  [20],  Rosvall  et  al.  proposed  the 
infomap  method.  Substantially  different  from  modularity-based  methods,  the 
infomap  method  is  based  on  flows  carried  by  networks  [20].  The  SHRINK  algo¬ 
rithm  in  [13]  is  another  algorithm  that  is  related  to  our  work.  It  is  a  parameter- 
free  hierarchical  network  clustering  algorithm  that  combines  the  advantages  of 
density-based  clustering  and  modularity  optimization  methods.  Work  in  [23]  uti¬ 
lized  social  influence  modeling  methods  in  the  detection  of  communities. 

3  Preliminary 

3.1  Notations 

A  social  network  G  =  {V,  E}  is  a  directed  graph  with  a  node  set  V  and 
an  edge  set  E.  A  node  u,  £  V  represents  a  user  in  the  social  network,  while  a 
directed  edge  £  E  represents  a  link  from  Vi  to  Vj. 

A  community  C  in  the  social  network  G  is  a  subset  of  the  node  set  V. 
We  consider  non-overlapping  communities  in  this  paper.  In  other  words,  we 
consider  the  partition  of  V  into  a  set  of  communities  C  =  Each  user 

in  the  network  should  belong  to  exactly  one  of  the  communities  in  C.  Given  a 
graph  G ,  a  community  detection  algorithm  divides  the  graph  G  into  a  set 
of  communities  C.  There  are  a  lot  of  different  community  detection  algorithms. 
Generally,  a  good  community  detection  algorithm  finds  a  partition,  so  that  (1) 
each  community  is  a  relatively  independent  compartment  of  the  graph,  and  (2) 
nodes  in  the  same  community  tend  to  have  dense  links  between  each  other. 

We  follow  the  definition  of  information  diffusion  process  in  the  IC  model 
[15].  An  information  diffusion  process  starts  with  a  set  of  seed  nodes  that  are 
active  at  the  first  place.  Active  nodes  can  activate  their  out-neighbors  in  the 
social  network.  Once  a  node  is  activated,  it  becomes  active  and  can  never  become 
inactive  again.  It  is  quite  often  for  real  applications  that  the  information  diffu¬ 
sion  processes  cannot  be  directly  observed.  For  example,  we  may  observe  that  a 
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person  got  infected  by  influenza,  but  we  do  not  know  from  whom  he  got  infected. 
We  define  a  cascade  O  =  {(ui,ti), . . . ,  (vm,tm)}  as  the  set  of  user  actions  dur¬ 
ing  an  information  diffusion  process.  An  action  (u,;,  t,-,)  in  O  represents  that  the 
user  Vi  becomes  active  at  time  tr.  In  this  paper,  we  focus  on  the  scenario  that  the 
information  diffusion  processes  are  not  directly  observed,  but  a  set  of  cascades 
is  observed. 


3.2  Datasets 

Foursquare [9].  In  this  dataset,  nodes  represent  users  of  the  Foursquare  website, 
while  edges  represent  friendship  relations.  Actions  are  defined  by  check-ins  of 
users.  Each  cascade  corresponds  to  a  location.  When  a  user  checks  in  at  a  location 
for  the  first  time,  she  becomes  active  for  the  corresponding  cascade.  This  dataset 
contains  18,107  users,  245,034  friendship  relations,  and  476,482  actions  of  43,063 
cascades. 

DBLP.  In  this  dataset,  nodes  represent  authors,  while  edges  represent  co-author 
relations.  We  extract  a  subgraph  of  the  DBLP  network  with  authors  and  papers 
in  the  areas  of  data  mining  and  machine  learning.  We  define  cascades  by  terms 
(defined  by  bi-grams)  in  the  titles  of  papers.  When  an  author  has  a  paper  with  a 
certain  term  in  the  title  for  the  first  time,  he  becomes  active  for  the  corresponding 
cascade.  This  dataset  contains  6,896  users,  111,044  friendship  relations,  and 
1,655,778  actions  of  162,904  cascades. 

4  Observations 

In  this  section,  we  explore  the  community  effects  on  information  diffusion  pro¬ 
cesses  via  analyses  on  real-world  datasets.  We  first  identify  communities  in  social 
networks,  and  then  study  cascades  with  respect  to  these  communities. 

4.1  Identifying  Communities  for  Information  Diffusion 

Communities  in  social  networks  can  be  defined  in  many  different  ways.  To  under¬ 
stand  the  effects  of  communities  on  the  information  diffusion  in  general,  we  apply 
two  different  community  detection  algorithms  to  the  two  networks,  and  conduct 
community  effect  analyses  for  both  algorithms. 

The  two  community  detection  methods  that  we  use  to  identify  communities 
are  the  fast  greedy  (FG)  method  [5],  and  the  infomap  (IM)  method  [20].  The  FG 
method  is  based  on  the  well-adopted  idea  of  modularity  maximization,  while  the 
IM  method  is  a  flow-based  method,  which  is  essentially  different  from  the  mod¬ 
ularity  maximization  methods.  We  choose  these  two  methods  because  (1)  they 
are  all  widely-used  community  detection  methods  that  prove  to  be  efficient  and 
accurate,  and  (2)  they  are  based  on  substantially  different  ideas.  Both  methods 
are  implemented  in  the  igraph  network  analysis  package  [6]. 
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Fig.  1.  Distribution  of  similarity  between  actions  of  user  pairs 


4.2  Action  Homophily  of  Communities 

We  first  look  into  the  effects  that  communities  have  on  the  actions  of  users.  We 
construct  a  vector  for  each  user  to  keep  the  action  information  of  that  user,  and 
then  compare  the  vectors  between  pairs  of  users.  We  check  whether  the  users 
who  belong  to  the  same  community  are  more  likely  to  have  similar  actions. 

Given  a  set  of  cascades  O  =  {O i, . . .  ,Om},  we  define  an  action  vector  a, 
for  each  user  Vi,  where  ai  =  (aj o>  ■  •  •  ,ciim )■  If  the  user  Vi  has  an  action  in  the 
cascade  Oj ,  we  set  to  1.  Otherwise,  we  set  a.y  to  0.  For  each  pair  of  users  '<y 
and  Vj,  we  calculate  the  cosine  similarity  between  the  action  vectors  aj  and  aj, 
and  then  study  the  distribution  of  similarity.  We  consider  three  different  cases 
here:  (1)  There  is  an  edge  ei3  between  v,  and  Vj,  and  vt  and  v3  belong  to  the  same 
community;  (2)  There  is  an  edge  between  Vi  and  Vj,  but  vr  and  v3  belong  to 
different  communities;  (3)  Vi  and  Vj  is  an  arbitrary  pair  of  nodes,  may  or  may 
not  having  an  edge  between  them.  For  each  case,  we  plot  the  distributions  of 
similarity,  and  check  whether  there  is  any  difference  between  the  distributions. 

Figure  1  shows  the  distributions  of  similarity  in  two  datasets,  with  two  sets 
of  communities  in  each  dataset.  In  each  setting,  we  observe  a  similar  discrepancy 
among  the  three  distributions:  comparing  with  linked  pairs  in  different  commu¬ 
nities,  linked  pairs  in  the  same  communities  have  larger  similarity;  comparing 
with  arbitrary  pairs,  linked  pairs  have  much  larger  similarity.  Intuitively,  friends 
in  the  same  communities  tend  to  have  stronger  link  between  each  other,  and 
they  have  more  chances  to  influence  each  other  indirectly  via  common  friends. 
The  results  are  quite  consistent  for  different  community  detection  methods. 


4.3  Role-Based  Homophily  of  Communities 

We  have  observed  the  action  homophily  of  communities.  However,  although  the 
similarity  between  linked  pairs  in  the  same  communities  is  relative  larger  than 
the  similarity  in  the  other  two  cases,  it  is  still  quite  small  (typically,  less  than  0.3). 
In  this  section,  we  introduce  the  role-based  homophily  of  communities,  and  show 
that  the  role-based  homophily  is  more  significant  than  the  action  homophily. 

With  a  set  of  cascades  O ,  we  build  a  support  matrix  S  for  the  influence 
between  users  in  the  social  networks.  The  element  at  the  *-th  row  and  the  j-th 
column  of  the  matrix  S  is  the  number  of  potential  influences  from  the  user  Vi 
to  the  user  Vj.  We  say  there  is  a  potential  influence  from  Vi  to  Vj,  if  both  of 
them  have  actions  in  the  same  cascade,  and  the  time  of  Vi  s  action  is  earlier 
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Fig.  2.  Distribution  of  similarity  between  influencer  feature  vectors  of  user  pairs 


than  the  time  of  Vj !s  action.  Formally,  it  is  defined  as:  s,jj  =  {Ok  £  O  \  Vi,Vj  £ 
V(Ok)  A  tfk  <  t°k}  |,  where  V(Ok)  is  the  set  of  users  that  has  an  action  in  the 
cascade  Ok,  and  tfk  is  the  time  of  v^s  action  in  the  cascade  Ok- 

We  define  s,* ,  the  i-th  row  of  S,  as  the  influencer  feature  vector  of  v, , 
and  s*j,  the  i-th  column  of  S,  as  the  influencee  feature  vector  of  u, .  The 
influencer  feature  vector  Sj*  captures  the  influence  that  has  on  other  users  in 
the  social  networks,  while  the  influencee  feature  vector  s*i  captures  the  influence 
from  other  users  to  the  user  Vi . 

Similar  to  what  we  did  in  Section  4.2,  we  calculate  the  cosine  similarity 
between  the  influencer /influencee  feature  vectors,  and  compare  the  distributions. 
Figures  2  and  3  show  the  comparison  of  distributions  of  influencer  feature  vector 
and  influencee  feature  vector,  respectively.  Similar  to  Figure  1,  comparing  with 
the  other  two  cases,  the  similarity  is  larger  for  the  case  that  users  are  linked 
and  are  in  the  same  communities.  There  are  a  few  new  observations  that  are 
interesting: 

First,  distributions  of  similarity  between  influencer/influencee  feature  vector 
(Figures  2  and  3)  show  significantly  larger  discrepancy  than  the  distributions  of 
similarity  of  action  vector  (Figure  1).  This  observation  suggests  that  for  users 
in  the  same  communities  the  role-based  homophily  is  much  stronger  than  the 
action  homophily.  The  effect  of  community  in  the  information  diffusion  process  is 
better  reflected  by  the  roles  that  users  play  in  the  information  diffusion  process, 
rather  than  the  results  of  information  diffusion  process  (whether  being  active  or 
inactive  for  a  cascade) . 

Second,  for  friends  in  the  same  community,  the  similarity  value  of  influencer 
and  influencee  feature  vectors  (typically  larger  than  0.5)  is  larger  than  the  sim¬ 
ilarity  of  action  vectors  (typically  less  than  0.3).  It  suggests  that  aggregation 
on  the  influencer/influencee  feature  vectors  of  users  without  significant  loss  of 
accuracy  is  more  feasible. 

Third,  the  influencee-based  homophily  is  more  significant  than  the  influencer- 
based  homophily,  especially  for  the  FG  algorithm.  This  is  easy  to  understand  by 
the  following  example:  professors  and  students  in  the  same  research  lab  have  sim¬ 
ilar  behaviors  as  influencees,  because  when  a  cascade  reaches  anyone  in  the  lab, 
it  is  very  likely  that  cascade  will  reach  everyone  in  the  lab  quickly,  but  professors 
are  probably  much  stronger  influencers  than  students. 
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Fig.  3.  Distribution  of  similarity  between  influencee  feature  vectors  of  user  pairs 


5  Community-Based  Fast  Influence  Model 

Based  on  the  observations  in  Section  4,  we  are  able  to  design  an  efficient  influence 
model  which  makes  use  of  the  community  effects.  The  community-based  fast 
influence  (CFI)  model  we  propose  in  this  section  is  an  approximate  model  for 
the  IC  model.  The  whole  framework  has  three  components,  namely  influence 
decoupling,  community  detection,  and  influence  maximization. 


5.1  Influence  Decoupling 

An  intuitive  way  to  construct  an  approximate  information  diffusion  model  based 
on  community  effects  is  to  consider  each  community  as  a  “super-node”  and 
make  information  propagates  through  “super-edges”  between  “super-nodes”. 
The  coarse-grained  information  diffusion  model  in  [8]  is  based  on  a  similar  idea. 

Although  this  intuitive  model  is  simple  and  seems  reasonable,  it  may  not 
work  for  our  task  here.  When  we  consider  a  community  as  a  “super-node”,  we 
have  to  aggregate  users’  roles  as  influencers  as  well  as  users’  roles  as  influencees. 
This  may  cause  a  problem:  the  influence  maximization  problem  requires  us  to 
determine  how  influential  each  user  is  and  find  the  set  of  seed  nodes  that  maxi¬ 
mizes  the  influence.  When  we  aggregate  the  roles  of  users  as  influencers,  we  lose 
the  necessary  information  for  solving  the  influence  maximization  problem. 

To  avoid  this  problem,  the  CFI  model  considers  the  roles  of  users  as  influ¬ 
encers  and  influencees  separately.  To  be  specific,  we  split  each  node  vt  in  the 
network  G  into  an  influencer  node  v°ut  and  influencee  node  v"1,  and  transform 
the  network  into  a  bipartite  graph  Gb ■  In  the  graph  G&,  there  is  an  edge  from 
v°ut  to  vj1  if  and  only  if  the  edge  exists  in  the  original  graph  G.  We  call  this 
transformation  from  the  original  network  G  to  the  bipartite  graph  Gb  influence 
decoupling.  The  left  part  of  Figure  4(a)  shows  an  example  of  influence  decou¬ 
pling.  In  the  original  graph  G,  there  is  an  edge  from  v±  to  V\ .  Correspondingly, 
there  is  an  edge  from  vlut  to  u™  in  G&.  The  result  of  influence  decoupling  is  that 
we  can  apply  the  community-based  aggregation  to  the  influencee  nodes  only. 

If  we  apply  the  original  IC  model  directly  to  the  decoupled  graph  G&,  we 
will  end  up  with  cascades  with  only  two  levels,  i.e.  only  the  nodes  that  are 
direct  out-neighbors  of  the  seed  nodes  can  become  active.  This  problem  can 
be  approximately  solved  by  the  community  detection  and  the  aggregation  of 
influencee  nodes.  Instead  of  limiting  influence  to  direct  out-neighbors,  the  CFI 
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(c)  Foursquare 


Fig.  4.  (a)  Inference  of  the  CFI  model,  (b)-(c)  Influence  of  different  sizes  of  seed  set 


model  allows  users  to  have  influence  on  the  communities  that  their  direct  out- 
neighbors  belong  to.  Notice  that  we  do  omit  the  indirect  influence  from  a  user  to 
the  nodes  that  are  neither  his  out-neighbor  nor  in  the  same  communities  as  his 
out-neighbor.  This  is  indeed  a  trade-off  between  the  accuracy  and  efficiency,  but 
the  loss  of  accuracy  is  actually  negligible.  This  is  because  the  influence  between 
nodes  in  different  communities  are  smaller  than  the  influence  between  nodes  in 
the  same  communities,  and  indirect  influence  are  generally  very  small.  We  will 
also  show  by  experiment  that  the  CFI  model  is  a  good  enough  approximation 
to  the  original  IC  model. 


5.2  Identifying  Communities 

We  now  discuss  the  community  detection  algorithm.  As  we  have  discussed  in  the 
last  section,  users  in  the  same  community  should  be  similar  influencees.  To  iden¬ 
tify  communities  so  that  users  in  the  same  communities  are  similar  as  influencees, 
we  design  an  agglomerative  clustering  algorithm.  It  starts  with  clusters  with  sin¬ 
gle  users,  and  iteratively  merges  clusters  together  based  on  similarity  between 
clusters.  As  shown  in  Figure  4(a),  the  clustering  procedure  is  conducted  on  the 
original  graph  G,  but  the  similarity  is  defined  by  users’  roles  as  influencees, 
and  the  communities  detected  by  the  algorithm  will  finally  be  applied  to  the 
influencee  nodes  in  the  decoupled  graph  G&. 

Similarity.  The  similarity  between  two  clusters  is  defined  as  the  cosine  similarity 
between  their  incident  influence  probability  vectors.  Let  Pi—>j  be  the  probability 
that  Vi  influences  Vj  directly  or  indirectly  (i.e.  the  probability  that  Vj  becomes 
active  if  vt  is  the  single  seed  node).  For  a  cluster  C  =  {vi±, . . .  Vin<j  },  we  define 
the  influence  that  user  Vj  on  C  as: 

=  /  rU  EfcSi  Pj^ik  ^  ejtik  e  E  for  some  ik  £  C 
,3_> C  \  0  otherwise. 

where  nc  is  the  number  of  nodes  in  the  cluster  C. 

We  define  incident  influence  probability  vectors  of  community  C  as  q c  = 
(gi_>c, . . .  qn^c),  and  the  similarity  between  two  clusters  C\  and  C2  as  sim(C\, 
C2)  =  qcx  ■  qca/GIqcJllqcJ)- 
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Influence  Probability  Estimation.  Similar  to  the  learning  algorithm  for  the 
IC  model  in  [10],  given  a  set  of  cascades  O ,  we  estimate  the  influence  probability 
Pi—,j  from  cascades  by  the  equation  as  follows: 

.  _  ay  _\{OkeO\  Vi,  Vj  G  V{Ok)  A  if x  <  t°' }  I 

Pi~*J  Si  I  {Ok  G  O  I  Vi  G  V(Ok)}  I 

Since  that  Vi  becomes  active  earlier  than  v3  does  not  necessarily  imply  that  Vi 
directly  or  indirectly  influences  Vj,  pi—,j  is  not  an  unbiased  estimator  of  Pi^j. 
However,  it  is  still  a  good  enough  estimator  for  the  CFI  model. 

Community  Detection  and  Influence  Aggregation.  The  purpose  of  com¬ 
munity  detection  in  the  CFI  model  learning  is  to  aggregate  users  who  play  sim¬ 
ilar  roles  as  influencees,  while  keep  the  accuracy  of  the  original  IC  model.  To 
serve  this  purpose,  we  adopt  a  community  detection  strategy  that  is  similar  to 
the  algorithm  in  [13].  By  iteratively  merging  clusters  into  larger  one,  we  get  a 
sequence  of  super-graph  Go,  G\,  G2,  ■  ■  Each  node  in  these  super-graphs  corre¬ 
sponds  to  a  cluster.  The  algorithm  starts  with  graph  Go,  in  which  each  cluster 
contains  a  single  user.  In  each  step  t,  we  find  from  Gt  connected  subgraphs  that 
contain  similar  nodes,  and  merge  these  subgraphs  to  generate  a  new  super-graph 
Gt+i.  We  repeat  these  steps,  until  the  similarity  between  any  two  neighbors  in 
Gt  are  below  a  threshold  0. 

Let  C(t)  =  {Ci, . . .  Cm(t)}  be  the  set  of  clusters  at  the  t-th  iteration.  We  say 
two  clusters  Ci  and  C2  are  neighbors,  if  there  exist  Vi  G  Ci  and  Vj  G  C2,  s.t.  edge 
e»j  or  eji  exists.  For  a  pair  of  connected  clusters,  we  say  they  are  a  mutually 
most  similar  pair  (ms-pair)  with  similarity  e  (denoted  by  Ci  <->e  C2),  if 
e  =  sim{C\,C2 )  =  maxc.e]v(Ci)  sim{C\,Ci)  =  maxCieJv(c2)  sim{C2,Ci),  where 
N(Ci )  is  the  set  of  neighbors  of  Cj. 

We  define  a  ms-subgraph  as  a  maximal  connected  subgraph  of  Gt  that 
are  connected  by  ms-pairs.  Formally,  a  graph  D  is  a  ms-subgraph  of  Gt  with 
similarity  e  if  and  only  if  (1)  for  any  two  nodes  C,,Gj  G  D,  there  exist  a  path 
<  Gj,  Ci . . .  Gfc,  Cj  >  in  D ,  s.t.  Cj  <— Ci,  Ci  <—>e  C2,. . .  ,Ck—  1  <—>e  Ck,  Ck  <— Cj] 
(2)  for  any  nodes  Ci  £  D  and  Cj  G  D ,  Gj  and  Cj  are  not  a  ms-pair.  By  this 
definition,  the  graph  can  be  partitioned  into  ms-subgraphs  (some  ms-subgraphs 
may  contain  only  one  single  node).  By  merging  ms-subgraphs  into  new  nodes, 
the  original  super-graph  can  be  reduced  into  a  smaller  super-graph. 

At  the  iteration  t,  we  first  find  out  all  the  ms-subgraphs  of  Gt,  and  then  merge 
each  ms-subgraph  D  that  contains  more  than  one  nodes  and  has  similarity  e  >  9 
into  a  new  node.  The  new  node  is  a  neighbor  to  any  node  that  was  a  neighbor 
of  any  node  in  D,  and  the  similarity  between  the  new  nodes  and  its  neighbors 
need  to  be  recalculated.  The  algorithm  stops  when  the  similarity  between  each 
linked  nodes  are  less  than  the  threshold  0,  and  the  clusters  at  that  point  of  time 
are  taken  as  communities. 

5.3  CFI-Based  Influence  Maximization  Algorithm 

In  this  subsection,  we  show  how  we  can  design  a  CFI-based  algorithm  for  the 
influence  maximization  of  the  IC  model.  The  influence  maximization  problem 
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is  defined  as  follows:  Given  an  IC  model  and  an  integer  k  >  0,  find  a  set  of  k 
seed  nodes,  so  that  the  influence  of  the  seed  nodes  is  maximized.  The  standard 
method  to  solve  this  problem  is  a  greedy  algorithm  [15].  It  starts  with  finding 
one  seed  node  that  maximizes  the  influence,  and  then  adds  a  second  node  to 
the  seed  node  set  so  that  the  increase  of  influence  is  maximized.  In  this  way, 
it  repeatedly  adds  nodes  to  the  seed  node  set,  until  it  gets  k  seed  nodes.  This 
greedy  algorithm  is  very  time-consuming,  because  in  each  step  it  uses  the  Monte 
Carlo  method  to  evaluate  all  the  remaining  nodes.  Optimized  versions  of  the 
greedy  algorithm  have  been  proposed  in  [17]  and  [11],  and  heuristic  algorithms 
have  been  proposed  in  [3] .  These  algorithms  also  use  sampling  for  the  evaluation 
of  nodes.  We  can  get  a  new  heuristic  algorithm  based  on  the  CFI  model.  This 
new  heuristic  algorithm  does  not  involve  random  sampling,  so  it  is  faster  than 
the  existing  algorithms,  especially  when  the  number  of  seed  nodes  k  is  large. 

The  CFI-based  influence  maximization  algorithm  also  adopts  a  greedy  frame¬ 
work.  In  each  step  t ,  the  node  that  can  maximize  the  influence  increase  is 
selected.  The  problem  is  how  we  can  estimate  the  influence  increase  using  the 
CFI  model.  When  t  =  1,  the  problem  is  reduced  to  estimating  the  influence  of 
each  single  node.  Let  C  =  {Ci, . . .  Cm}  be  the  set  of  communities  in  the  CFI 
model.  We  estimate  the  influence  of  a  user  ty  as  Inf({vi})  =  J]CgC  ncqi^c> 
where  nc  is  the  number  of  users  in  the  community  C. 

Once  we  select  the  node  with  the  greatest  influence  to  be  the  first  seed 
node,  we  cannot  simply  select  the  second  most  influential  node  to  be  the  second 
seed  node,  because  the  nodes  activated  by  the  first  node  and  the  second  node 
may  overlap.  We  need  to  deduct  the  number  of  nodes  that  has  already  been 
activated  by  the  first  node.  To  do  that,  we  decrease  the  number  of  nodes  from 
each  community  by  the  estimated  influence  of  the  first  node  V{1 .  Formally,  we 
let  n],  =  nc  —  ricqi^Cj  which  is  the  expected  number  of  nodes  in  the  community 
C  that  are  not  activated  by  the  influence  of  Vi1 ,  and  then  we  select  the  second 
node  Vi2  by  maximizing  the  increase  of  influence:  AInf(v)  =  Inf{{vix,  u})  — 
Inf({vix })  =  SceC  nc<Zi2->C-  For  t  =  3, . . . ,  k,  we  can  repeat  the  above  step 
to  select  Via, . . .  ,Vik.  Generally,  we  select  Vit  by  maximizing  X^ceC  nc_19u^C) 
where  n‘_1  =  n*-2  -  n*_2git_1^c- 

6  Experiment 

6.1  Experiment  Setup 

We  use  the  DBLP  and  Foursquare  networks  described  in  Section  3  for  the  exper¬ 
iment.  For  each  network,  we  construct  an  IC  model  by  assigning  diffusion  prob¬ 
ability  1  —  e~o  oic  edge.  A  similar  method  for  model  construction  has 

been  used  in  [4] .  For  the  DBLP  network,  c  is  the  number  of  papers  coauthored  by 
the  two  authors.  For  the  Foursquare  network,  c  is  the  number  of  locations  that 
both  users  visited.  We  do  not  construct  the  ground-truth  models  by  learning  the 
diffusion  probabilities  directly  from  the  actions  in  the  datasets  because  we  want 
to  avoid  the  inaccuracy  caused  by  model  learning  algorithm. 
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We  then  sample  each  ground  truth  IC  model  to  get  5,000  cascades,  each 
with  10  seed  nodes  and  use  the  sampled  cascades  to  learn  the  CFI  model.  For 
the  baselines,  since  they  are  all  based  on  the  IC  model,  we  directly  apply  the 
influence  maximization  algorithms  on  the  ground  truth  model.  We  evaluate  the 
influence  of  seed  nodes  by  sampling  the  ground  truth  model  10,000  times  to  get 
the  average  number  of  active  nodes.  We  compare  the  following  algorithms: 

CFIGreedy  The  CFI-based  influence  maximization  algorithm  with  9  =  0.4. 

ICGreedy  The  greedy  influence  maximization  of  IC  model  with  the  CELF++ 
optimization  [11].  We  take  a  sample  size  of  10, 000  to  estimate  the  influence. 

Degree  The  heuristic  algorithm  that  selects  the  nodes  with  the  largest 
weighted  degree.  The  weighted  degree  of  a  node  is  the  sum  of  the  diffusion 
probabilities  over  the  out-going  edges. 

DegreeDiscount  The  degree  discount  heuristics  based  on  the  degree  heuris¬ 
tics  [4] .  The  basic  idea  is  to  discount  the  degree  for  users  whose  friends  have  been 
selected  as  seed  nodes. 

Random  Randomly  selecting  seed  nodes. 


6.2  Results 

Effectiveness  Results  for  Influence  Maximization.  First,  we  present  the 
effectiveness  results  of  the  influence  maximization  algorithms  in  terms  of  the 
number  of  seed  nodes.  We  test  the  effectiveness  of  each  algorithm  with  increas¬ 
ing  number  of  seed  nodes.  The  results  of  the  DBLP  and  Foursquare  datasets  are 
illustrated  in  Figures  4(b)  and  4(c),  respectively.  In  each  case,  we  illustrate  the 
number  of  seed  nodes  on  the  X-axis,  and  the  influence  of  seed  nodes  on  the  Y- 
axis.  For  the  Foursquare  data,  CFIGreedy  performs  worse  than  ICGreedy  when 
the  size  of  the  seed  set  is  small,  but  does  better  than  ICGreedy  when  the  size  is 
greater  than  25.  This  is  a  very  interesting  observation.  Although  the  CFI  model 
is  designed  to  be  an  approximate  model  for  the  IC  model,  the  greedy  algorithm 
of  the  CFI  model  does  not  necessarily  performs  worse  than  the  greedy  algorithm 
of  the  IC  model.  This  is  because  the  CFI  model  considers  the  community  struc¬ 
ture  of  social  networks,  and  the  consideration  of  community  structure  may  favor 
combinations  of  seed  nodes  that  cover  more  communities.  For  the  DBLP  dataset, 
CFIGreedy  is  less  effective  than  ICGreedy.  However,  the  difference  is  not  very 
significant,  especially  when  we  consider  the  fact  that  CFIGreedy  is  signifi¬ 
cantly  faster.  Besides,  CFIGreedy  consistently  outperforms  DegreeDiscount , 
Degree  and  Random.  Notice  that  although  DegreeDiscount  is  a  simple  heuris¬ 
tic  method,  previous  work  showed  that  it  is  a  very  effective  method  that  nearly 
matches  the  performance  of  ICGreedy  [4]. 

Efficiency  Results  for  Influence  Maximization.  We  also  tested  the  effi¬ 
ciency  of  influence  maximization  methods  with  varying  number  of  seed  nodes. 
The  efficiency  results  for  the  DBLP  and  Foursquare  datasets  are  illustrated 
in  Figures  5(a)  and  5(b),  respectively.  The  X-axis  denotes  the  number  of  seed 
nodes,  whereas  the  Y-axis  denotes  the  running  time.  Since  heuristics  as  Random , 
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(a)  DBLP 


(b)  Foursquare 


(c)  Influence 


(d)  Runing  time 


Fig.  5.  (a)-(b)  Running  time  with  different  sizes  of  seed  set;  (c)-(d)  Effects  of  9 


Degree ,  and  DegreeDiscount  are  obviously  very  fast,  we  only  show  the  run¬ 
ning  time  of  CFIGreedy  and  ICGreedy.  As  illustrated  in  the  figures,  influence 
maximization  based  on  CFIGreedy  is  several  orders  of  magnitudes  faster  than 
ICGreedy  with  CELF+- 1-  optimization.  We  also  add  together  the  time  spent 
on  the  learning  of  the  CFI  model  and  the  running  time  of  CFIGreedy  to  get  a 
total  time  for  the  influence  maximization  on  the  CFI  model,  and  illustrate  the 
total  time  as  “CFI(+learning  time)”  in  the  figures.  Even  when  the  learning  time 
is  added,  the  total  running  time  for  the  CFI  model  is  still  significantly  smaller 
the  running  time  of  ICGreedy.  For  example,  for  the  DBLP  dataset,  it  takes 
ICGreedy  9,079  seconds  to  find  60  seed  nodes,  while  the  total  running  time  of 
the  CFI  model  is  34  seconds.  Notice  that,  in  real  applications,  the  IC  models 
also  need  to  be  learned  from  user  actions,  and  the  running  time  of  ICGreedy 
should  also  be  added  with  the  learning  time  of  the  IC  model. 

Parameter  Sensitivity.  Finally,  we  tested  the  sensitivity  of  the  CFI-based 
influence  maximization  with  the  clustering  threshold  9.  Figure  5(c)  shows  the 
influence  of  seed  nodes  selected  by  the  CFI  model  with  varying  9.  We  illustrate 
the  value  of  9  on  the  X-axis,  and  the  influence  of  seed  nodes  on  the  Y-axis. 
Figure  5(d)  shows  the  total  running  time  of  influence  maximization  with  varying 
9.  We  illustrate  the  value  of  9  on  the  X-axis,  and  the  total  running  time  of  the 
influence  maximization  (the  running  time  of  model  learning  plus  the  running 
time  of  CFIGreedy)  on  the  Y-axis.  In  each  case,  the  number  of  seed  nodes  is 
set  to  50.  We  show  the  results  on  the  Foursquare  dataset,  while  similar  trends 
are  observed  on  the  DBLP  dataset.  When  the  threshold  9  decreases,  the  running 
time  increases,  because  the  agglomerative  clustering  takes  more  steps  when  9  is 
smaller.  It  is  an  interesting  observation  that  the  influence  does  not  monotonically 
increases  when  9  decreases.  When  9  is  too  large,  the  size  of  communities  are 
very  small,  so  the  CFI  model  omits  too  much  indirect  influence.  When  9  is  too 
small,  the  users  in  the  same  community  do  not  have  enough  similarity  between 
each  other.  Both  cases  cause  loss  of  accuracy.  Nevertheless,  we  notice  that  the 
variation  of  influence  is  not  significant.  The  Y-axis  of  Figure  5(c)  does  not  start 
at  0.  When  9  varies  from  0.3  to  1.0,  the  variation  of  influence  is  within  ±1.5%. 


7  Conclusion 

In  this  paper,  we  explore  the  effects  of  communities  on  the  information  diffu¬ 
sion  processes.  We  quantitatively  analyze  the  real-world  information  diffusion 
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datasets  to  get  insightful  findings  on  the  community  effects.  As  an  application 
of  these  findings,  we  propose  the  CFI  model,  which  is  substantially  different 
from  existing  approximate  algorithms.  Experiment  shows  that  the  CFI-based 
influence  maximization  algorithm  can  get  comparable  effectiveness  as  influence 
maximization  algorithms  based  on  the  IC  model,  but  is  significantly  faster.  Our 
work  sheds  light  on  the  effects  of  communities  in  the  diffusion  of  information, 
and  brings  a  new  idea  to  the  approximation  of  information  diffusion  processes. 
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